pacman::p_load(tmap, sf, sp, DT, stplanr,
performance, reshape2,
ggpubr, tidyverse)Take-Home Exercise 2
Take-home Exercise 2: Applied Spatial Interaction Models: A case study of Singapore public bus commuter flows
Motivation and Objective
This take-home exercise is motivated by two main reasons. Firstly, despite increasing amounts of open data available for public consumption, there has not been significant practice research carried out to show how these disparate data sources can be integrated, analysed, and modelled to support policy making decisions.
Secondly, there is a general lack of practical research to show how geospatial data science and analysis (GDSA) can be used to support decision-making.
Hence, your task for this take-home exercise is to conduct a case study to demonstrate the potential value of GDSA to integrate publicly available data from multiple sources for building a spatial interaction models to determine factors affecting urban mobility patterns of public bus transit.
The Data
Open Government Data
For the purpose of this assignment, data from several open government sources will be used:
Passenger Volume by Origin Destination Bus Stops, Bus Stop Location, Train Station and Train Station Exit Point, just to name a few of them, from LTA DataMall.
Master Plan 2019 Subzone Boundary, HDB Property Information, School Directory and Information and other relevant data from Data.gov.sg.
Specially collected data
- Businesses, retail and services, leisure and recreation, etc geospatial data sets assemble by course instructor. (Refer to eLearn)
The Task
Geospatial Data Science
Derive an analytical hexagon data of 325m (this distance is the perpendicular distance between the centre of the hexagon and its edges) to represent the traffic analysis zone (TAZ).
With reference to the time intervals provided in the table below, construct an O-D matrix of commuter flows for a time interval of your choice by integrating Passenger Volume by Origin Destination Bus Stops and Bus Stop Location from LTA DataMall. The O-D matrix must be aggregated at the analytics hexagon level
Peak hour period Bus tap on time Weekday morning peak 6am to 9am Weekday afternoon peak 5pm to 8pm Weekend/holiday morning peak 11am to 2pm Weekend/holiday evening peak 4pm to 7pm Display the O-D flows of the passenger trips by using appropriate geovisualisation methods (not more than 5 maps).
Describe the spatial patterns revealed by the geovisualisation (not more than 100 words per visual).
Assemble at least three propulsive and three attractiveness variables by using aspatial and geospatial from publicly available sources.
Compute a distance matrix by using the analytical hexagon data derived earlier.
Spatial Interaction Modelling
Calibrate spatial interactive models to determine factors affecting urban commuting flows at the selected time interval.
Present the modelling results by using appropriate geovisualisation and graphical visualisation methods. (Not more than 5 visuals)
With reference to the Spatial Interaction Model output tables, maps and data visualisation prepared, describe the modelling results. (not more than 100 words per visual).
FIRST STEP
Load in necessary packages:
As we are aware, there is an increasing amounts of open data available, but there has not been significant practice research carried out to show how these disparate data sources can be integrated, analysed, and modelled to support policy making decisions. In this section I will be performing integration of various data sources.
I will first check the various database selected to see how is it possible to build a hollistic database.
Data Import, Extraction, Processing
Geospatial Data - Bus Stop
BusStop <- st_read(dsn="data/geospatial",
layer="BusStop")%>%
st_transform(crs = 3414)Reading layer `BusStop' from data source
`C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
I see that BusStop is a point geometry SF with “BUS_STOP_N”, “BUS_ROOF_N”, “LOC_DESC” and its geometry.
MPSZ <- st_read(dsn="data/geospatial",
layer="MPSZ-2019")%>%
st_transform(crs = 3414)Reading layer `MPSZ-2019' from data source
`C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS: WGS 84
I see that MPSZ is a Multipolygon geometry SF with “SUBZONE_N”, “SUBZONE_C”, “PLN_AREA_N”, “PLN_AREA_C”, “REGION_N”, “REGION_C” and its geometry.
Creating Hexagon
Traffic Analysis Zone (TAZ)
Traffic analysis zones are universally used in travel demand modelling to represent the spatial distribution of trip1 origins and destinations, as well as the population, employment and other spatial attributes that generate or otherwise influence travel demand. The urban area is divided into a set of mutually exclusive and collectively exhaustive zones. While travel actually occurs from one point in the urban region to another, all trip origins and destinations in a travel demand model are represented at the spatially aggregate level of the movement from an origin zone to a destination zone. These movements are further aggregated within network assignment models as originating and ending at single points within the origin and destination zones – the zone centroids.
As we should derive an analytical hexagon data of 325m (this distance is the perpendicular distance between the centre of the hexagon and its edges) to represent the TAZ,
Per the documentation of st_make_grid:
cellsize is numeric of length 1 or 2 with target cellsize: ..for hexagonal cells the distance between opposite edges (edge length is cellsize/sqrt(3)). A length units object can be passed, or an area unit object with area size of the square or hexagonal cell.
In this case, the distance between opposite edges should be 325m * 2. And the length of hexagon should be 325m.
BusStop_hexagon_grid = st_make_grid(BusStop, 325, what = "polygons", square = FALSE)
BusStop_hexagon_sf = st_sf(geometry = BusStop_hexagon_grid) %>%
# add grid ID
mutate(grid_id = 1:length(lengths(BusStop_hexagon_grid)))We will next use st_intersection() for point and polygon overlay, to combine the data sets. This will provide us with output in point sf object.
BusStop_hexagon <- st_intersection(BusStop_hexagon_sf, BusStop) %>%
select(1,2,4) %>%
st_drop_geometry()MPSZ_hexagon <- st_intersection(BusStop_hexagon_sf, MPSZ) %>%
select(1:3)
BusStop_MPSZ_hexagon <- left_join(BusStop_hexagon,MPSZ_hexagon,
by = "grid_id")Note this BusStop_MPSZ_hexagon only contain the hexagon which has got bus stop within the area.
write_rds(BusStop_hexagon, "data/rds/BusStop_hexagon.rds")Performing the Relational Join (to update attribute table of one geospatial with another aspatial data set)
Now we will next combine this onto our our odbs_peak data frame which shows the total number of trips from particular bus stop during peak hour.
Aspatial Data
Next we import the Aspatial Data of PASSENGER VOLUME BY ORIGIN DESTINATION BUS STATIONS, downloaded via API (postman GET) from Data Mall LTA. For the purpose of this exercise the Aug 2023 Data will be used.
OD_bus <- read_csv("data/aspatial/origin_destination_bus_202308.csv")
#non-spatial data with no geometry featuresOD_bus$ORIGIN_PT_CODE <- as.factor(OD_bus$ORIGIN_PT_CODE)
OD_bus$DESTINATION_PT_CODE <- as.factor(OD_bus$DESTINATION_PT_CODE) Noted that the aspatial data for bus has total of 7 columns YEAR_MONTH, DAY_TYPE, PT_TYPE, ORIGIN_PT_CODE, DESTINATION_PT_CODE, TIME_PER_HOUR, TOTAL_TRIPS.
For the purpose of this exercise, we only look at those Bus tap on time on Weekday morning peak, between 6am to 9am.
OD_bus_peak <- OD_bus %>%
filter(DAY_TYPE == "WEEKDAY" &
(TIME_PER_HOUR >=6 & TIME_PER_HOUR <=9))
summary(OD_bus_peak) YEAR_MONTH DAY_TYPE TIME_PER_HOUR PT_TYPE
Length:669211 Length:669211 Min. :6.000 Length:669211
Class :character Class :character 1st Qu.:7.000 Class :character
Mode :character Mode :character Median :8.000 Mode :character
Mean :7.595
3rd Qu.:9.000
Max. :9.000
ORIGIN_PT_CODE DESTINATION_PT_CODE TOTAL_TRIPS
22009 : 1925 22009 : 1803 Min. : 1.00
84009 : 1817 52009 : 1785 1st Qu.: 2.00
75009 : 1757 84009 : 1732 Median : 7.00
52009 : 1678 75009 : 1658 Mean : 39.49
59009 : 1656 59009 : 1534 3rd Qu.: 24.00
46009 : 1535 46009 : 1506 Max. :29151.00
(Other):658843 (Other):659193
Computing the distance matrix
We use the spDists() coz faster.
BusStop_hexagon_sp <- as(BusStop_hexagon_sf, "Spatial")dist <- spDists(BusStop_hexagon_sp,
longlat = FALSE)
#because or subzone is already svr21. Otherwise it will treat data as x and y, then calculate great circle distance.
head(dist, n=c(10, 10)) #only list first 10 col and 10 rows [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.0000 562.9165 1125.8330 1688.7495 2251.6660 2814.5826 3377.4991
[2,] 562.9165 0.0000 562.9165 1125.8330 1688.7495 2251.6660 2814.5826
[3,] 1125.8330 562.9165 0.0000 562.9165 1125.8330 1688.7495 2251.6660
[4,] 1688.7495 1125.8330 562.9165 0.0000 562.9165 1125.8330 1688.7495
[5,] 2251.6660 1688.7495 1125.8330 562.9165 0.0000 562.9165 1125.8330
[6,] 2814.5826 2251.6660 1688.7495 1125.8330 562.9165 0.0000 562.9165
[7,] 3377.4991 2814.5826 2251.6660 1688.7495 1125.8330 562.9165 0.0000
[8,] 3940.4156 3377.4991 2814.5826 2251.6660 1688.7495 1125.8330 562.9165
[9,] 4503.3321 3940.4156 3377.4991 2814.5826 2251.6660 1688.7495 1125.8330
[10,] 5066.2486 4503.3321 3940.4156 3377.4991 2814.5826 2251.6660 1688.7495
[,8] [,9] [,10]
[1,] 3940.4156 4503.3321 5066.2486
[2,] 3377.4991 3940.4156 4503.3321
[3,] 2814.5826 3377.4991 3940.4156
[4,] 2251.6660 2814.5826 3377.4991
[5,] 1688.7495 2251.6660 2814.5826
[6,] 1125.8330 1688.7495 2251.6660
[7,] 562.9165 1125.8330 1688.7495
[8,] 0.0000 562.9165 1125.8330
[9,] 562.9165 0.0000 562.9165
[10,] 1125.8330 562.9165 0.0000
Notice that the output dist is a matrix object class of R. Also notice that the column heanders and row headers are not labeled with the planning subzone codes.
Labelling column and row headers of a distance matrix
First, we will create a list sorted according to the the distance matrix by planning sub-zone code.
grid <- BusStop_hexagon_sf$grid_idNext we will attach SUBZONE_C to row and column for distance matrix matching ahead
colnames(dist) <- paste0(grid)
rownames(dist) <- paste0(grid)Pivoting distance value by SUBZONE_C
Next, we will pivot the distance matrix into a long table by using the row and column subzone codes as show in the code chunk below.
distPair <- melt(dist) %>%
rename(dist = value)
head(distPair, 10) Var1 Var2 dist
1 1 1 0.0000
2 2 1 562.9165
3 3 1 1125.8330
4 4 1 1688.7495
5 5 1 2251.6660
6 6 1 2814.5826
7 7 1 3377.4991
8 8 1 3940.4156
9 9 1 4503.3321
10 10 1 5066.2486
Note that melt() is a old reshape tool function, that take dist matrix and convert it to long table, 1. origin, 2. destination, 3. distance matrix
Notice that the within zone distance is 0.
Updating intra-zonal distances
Now I am going to append a constant value to replace the intra-zonal distance of 0, first select and find out the minimum value of the distance by using summary().
distPair <- distPair %>%
rename(orig = Var1,
dest = Var2)distPair %>%
filter(dist > 0) %>%
summary() orig dest dist
Min. : 1 Min. : 1 Min. : 325
1st Qu.: 3301 1st Qu.: 3301 1st Qu.:11500
Median : 6600 Median : 6600 Median :18086
Mean : 6600 Mean : 6600 Mean :18991
3rd Qu.: 9900 3rd Qu.: 9900 3rd Qu.:25526
Max. :13200 Max. :13200 Max. :51798
write_rds(distPair, "data/rds/distPair.rds") Preparing flow data
Note that although I would want to keep all the fields intact as I wasnt sure what are the data I need for the next steps. It is taking too much processing power and space, so i will summarize and aggregate the value of selected time.
BusStop_Trips <- left_join(OD_bus_peak, BusStop_hexagon, #the left join is so to get grid ID from hexagon file
by = c("ORIGIN_PT_CODE" = "BUS_STOP_N")) %>%
drop_na() %>%
rename(ORIGIN_BS = ORIGIN_PT_CODE,
ORIGIN_GRID = grid_id,
DESTIN_BS = DESTINATION_PT_CODE
)%>%
group_by(ORIGIN_GRID, ORIGIN_BS, DESTIN_BS) %>%
summarize(TRIPS = sum(TOTAL_TRIPS))But there isn’t grid ID for my destination so i will now do the following to get complete picture:
BusStop_Trips <- BusStop_Trips %>%
left_join(BusStop_hexagon, #the left join is so to get grid ID from hexagon file
by = c("DESTIN_BS" = "BUS_STOP_N")) %>%
drop_na() %>%
rename(DESTIN_GRID = grid_id)%>%
select(1,2,5,3,4)BusStop_Trips <- unique(BusStop_Trips)Visualising Spatial Interaction
Separating intra-flow from passenger volume df
To add three new fields in BusStop_Trips dataframe, that is to be used for flow.
str(BusStop_Trips)gropd_df [236,683 × 5] (S3: grouped_df/tbl_df/tbl/data.frame)
$ ORIGIN_GRID: int [1:236683] 52 52 52 52 52 52 149 149 149 200 ...
$ ORIGIN_BS : chr [1:236683] "25059" "25059" "25059" "25059" ...
$ DESTIN_GRID: int [1:236683] 783 345 295 580 630 735 783 345 735 2079 ...
$ DESTIN_BS : chr [1:236683] "24719" "25709" "25719" "25771" ...
$ TRIPS : num [1:236683] 54 2 1 2 1 2 34 13 3 2 ...
- attr(*, "groups")= tibble [4,962 × 3] (S3: tbl_df/tbl/data.frame)
..$ ORIGIN_GRID: int [1:4962] 52 149 200 201 245 295 295 295 298 342 ...
..$ ORIGIN_BS : chr [1:4962] "25059" "25751" "26379" "26369" ...
..$ .rows : list<int> [1:4962]
.. ..$ : int [1:6] 1 2 3 4 5 6
.. ..$ : int [1:3] 7 8 9
.. ..$ : int [1:5] 10 11 12 13 14
.. ..$ : int [1:8] 15 16 17 18 19 20 21 22
.. ..$ : int [1:9] 23 24 25 26 27 28 29 30 31
.. ..$ : int [1:8] 32 33 34 35 36 37 38 39
.. ..$ : int [1:10] 40 41 42 43 44 45 46 47 48 49
.. ..$ : int [1:14] 50 51 52 53 54 55 56 57 58 59 ...
.. ..$ : int [1:10] 64 65 66 67 68 69 70 71 72 73
.. ..$ : int [1:7] 74 75 76 77 78 79 80
.. ..$ : int [1:6] 81 82 83 84 85 86
.. ..$ : int [1:7] 87 88 89 90 91 92 93
.. ..$ : int [1:11] 94 95 96 97 98 99 100 101 102 103 ...
.. ..$ : int [1:16] 105 106 107 108 109 110 111 112 113 114 ...
.. ..$ : int [1:7] 121 122 123 124 125 126 127
.. ..$ : int [1:12] 128 129 130 131 132 133 134 135 136 137 ...
.. ..$ : int [1:17] 140 141 142 143 144 145 146 147 148 149 ...
.. ..$ : int [1:8] 157 158 159 160 161 162 163 164
.. ..$ : int [1:12] 165 166 167 168 169 170 171 172 173 174 ...
.. ..$ : int [1:8] 177 178 179 180 181 182 183 184
.. ..$ : int [1:14] 185 186 187 188 189 190 191 192 193 194 ...
.. ..$ : int [1:9] 199 200 201 202 203 204 205 206 207
.. ..$ : int [1:4] 208 209 210 211
.. ..$ : int [1:9] 212 213 214 215 216 217 218 219 220
.. ..$ : int [1:7] 221 222 223 224 225 226 227
.. ..$ : int [1:5] 228 229 230 231 232
.. ..$ : int [1:7] 233 234 235 236 237 238 239
.. ..$ : int [1:7] 240 241 242 243 244 245 246
.. ..$ : int 247
.. ..$ : int [1:4] 248 249 250 251
.. ..$ : int [1:5] 252 253 254 255 256
.. ..$ : int [1:5] 257 258 259 260 261
.. ..$ : int [1:20] 262 263 264 265 266 267 268 269 270 271 ...
.. ..$ : int [1:4] 282 283 284 285
.. ..$ : int [1:13] 286 287 288 289 290 291 292 293 294 295 ...
.. ..$ : int [1:4] 299 300 301 302
.. ..$ : int [1:7] 303 304 305 306 307 308 309
.. ..$ : int [1:8] 310 311 312 313 314 315 316 317
.. ..$ : int [1:8] 318 319 320 321 322 323 324 325
.. ..$ : int [1:4] 326 327 328 329
.. ..$ : int [1:14] 330 331 332 333 334 335 336 337 338 339 ...
.. ..$ : int [1:6] 344 345 346 347 348 349
.. ..$ : int [1:6] 350 351 352 353 354 355
.. ..$ : int [1:5] 356 357 358 359 360
.. ..$ : int [1:10] 361 362 363 364 365 366 367 368 369 370
.. ..$ : int [1:6] 371 372 373 374 375 376
.. ..$ : int [1:6] 377 378 379 380 381 382
.. ..$ : int [1:4] 383 384 385 386
.. ..$ : int [1:16] 387 388 389 390 391 392 393 394 395 396 ...
.. ..$ : int [1:5] 403 404 405 406 407
.. ..$ : int [1:9] 408 409 410 411 412 413 414 415 416
.. ..$ : int [1:7] 417 418 419 420 421 422 423
.. ..$ : int [1:5] 424 425 426 427 428
.. ..$ : int [1:3] 429 430 431
.. ..$ : int [1:6] 432 433 434 435 436 437
.. ..$ : int [1:8] 438 439 440 441 442 443 444 445
.. ..$ : int [1:2] 446 447
.. ..$ : int [1:9] 448 449 450 451 452 453 454 455 456
.. ..$ : int [1:3] 457 458 459
.. ..$ : int [1:4] 460 461 462 463
.. ..$ : int [1:4] 464 465 466 467
.. ..$ : int [1:5] 468 469 470 471 472
.. ..$ : int [1:4] 473 474 475 476
.. ..$ : int [1:32] 477 478 479 480 481 482 483 484 485 486 ...
.. ..$ : int [1:5] 509 510 511 512 513
.. ..$ : int [1:42] 514 515 516 517 518 519 520 521 522 523 ...
.. ..$ : int [1:8] 556 557 558 559 560 561 562 563
.. ..$ : int [1:7] 564 565 566 567 568 569 570
.. ..$ : int [1:5] 571 572 573 574 575
.. ..$ : int [1:5] 576 577 578 579 580
.. ..$ : int [1:3] 581 582 583
.. ..$ : int [1:3] 584 585 586
.. ..$ : int [1:17] 587 588 589 590 591 592 593 594 595 596 ...
.. ..$ : int [1:3] 604 605 606
.. ..$ : int [1:106] 607 608 609 610 611 612 613 614 615 616 ...
.. ..$ : int [1:7] 713 714 715 716 717 718 719
.. ..$ : int [1:12] 720 721 722 723 724 725 726 727 728 729 ...
.. ..$ : int 732
.. ..$ : int [1:3] 733 734 735
.. ..$ : int [1:6] 736 737 738 739 740 741
.. ..$ : int [1:3] 742 743 744
.. ..$ : int [1:25] 745 746 747 748 749 750 751 752 753 754 ...
.. ..$ : int [1:9] 770 771 772 773 774 775 776 777 778
.. ..$ : int [1:34] 779 780 781 782 783 784 785 786 787 788 ...
.. ..$ : int [1:10] 813 814 815 816 817 818 819 820 821 822
.. ..$ : int 823
.. ..$ : int [1:41] 824 825 826 827 828 829 830 831 832 833 ...
.. ..$ : int [1:4] 865 866 867 868
.. ..$ : int [1:11] 869 870 871 872 873 874 875 876 877 878 ...
.. ..$ : int [1:5] 880 881 882 883 884
.. ..$ : int [1:2] 885 886
.. ..$ : int [1:2] 887 888
.. ..$ : int [1:9] 889 890 891 892 893 894 895 896 897
.. ..$ : int [1:4] 898 899 900 901
.. ..$ : int 902
.. ..$ : int [1:3] 903 904 905
.. ..$ : int [1:10] 906 907 908 909 910 911 912 913 914 915
.. ..$ : int [1:3] 916 917 918
.. ..$ : int [1:6] 919 920 921 922 923 924
.. .. [list output truncated]
.. ..@ ptype: int(0)
..- attr(*, ".drop")= logi TRUE
This above checks show that my ORIGIN_BS column is chr, therefore i will now convert it to factor:
BusStop_Trips$ORIGIN_BS <- as.factor(BusStop_Trips$ORIGIN_BS)BusStop_Trips$FlowNoIntra <- ifelse(
BusStop_Trips$ORIGIN_GRID == BusStop_Trips$DESTIN_GRID,
0, BusStop_Trips$TRIPS)
BusStop_Trips$offset <- ifelse(
BusStop_Trips$ORIGIN_GRID == BusStop_Trips$DESTIN_GRID,
0.000001, 1)Combining passenger volume data with distance value
BusStop_Trips1 <- BusStop_Trips %>%
left_join (distPair,
by = c("ORIGIN_GRID" = "orig",
"DESTIN_GRID" = "dest"))summary(BusStop_Trips1) ORIGIN_GRID ORIGIN_BS DESTIN_GRID DESTIN_BS
Min. : 52 22009 : 591 Min. : 52 Length:236683
1st Qu.: 5846 84009 : 564 1st Qu.: 5956 Class :character
Median : 7637 75009 : 539 Median : 7602 Mode :character
Mean : 7473 52009 : 538 Mean : 7457
3rd Qu.: 9087 64009 : 451 3rd Qu.: 9035
Max. :13117 46009 : 444 Max. :13117
(Other):233556
TRIPS FlowNoIntra offset dist
Min. : 1.0 Min. : 0.0 Min. :0.000001 Min. : 0
1st Qu.: 3.0 1st Qu.: 3.0 1st Qu.:1.000000 1st Qu.: 1810
Median : 15.0 Median : 15.0 Median :1.000000 Median : 3748
Mean : 109.5 Mean : 109.4 Mean :0.995475 Mean : 4865
3rd Qu.: 58.0 3rd Qu.: 58.0 3rd Qu.:1.000000 3rd Qu.: 6917
Max. :96630.0 Max. :96630.0 Max. :1.000000 Max. :24758
From the summary above, it is noted that there are some origin and destination bus stops within the same hexagon grid. For the purpose of this exercise we will exclude them, as the aim is to find out the interozonal (hexagon) flow.
BusStop_Trips2 <- BusStop_Trips1 %>%
filter(dist>0)
summary(BusStop_Trips2) ORIGIN_GRID ORIGIN_BS DESTIN_GRID DESTIN_BS
Min. : 52 22009 : 591 Min. : 52 Length:235612
1st Qu.: 5866 84009 : 563 1st Qu.: 5956 Class :character
Median : 7639 75009 : 539 Median : 7602 Mode :character
Mean : 7475 52009 : 538 Mean : 7459
3rd Qu.: 9087 64009 : 451 3rd Qu.: 9035
Max. :13117 46009 : 443 Max. :13117
(Other):232487
TRIPS FlowNoIntra offset dist
Min. : 1.0 Min. : 1.0 Min. :1 Min. : 325
1st Qu.: 3.0 1st Qu.: 3.0 1st Qu.:1 1st Qu.: 1810
Median : 15.0 Median : 15.0 Median :1 Median : 3748
Mean : 109.9 Mean : 109.9 Mean :1 Mean : 4887
3rd Qu.: 59.0 3rd Qu.: 59.0 3rd Qu.:1 3rd Qu.: 6917
Max. :96630.0 Max. :96630.0 Max. :1 Max. :24758
Now we see the minimum is 325, which is the the perpendicular distance between the centre of the hexagon and its edges (next hexagon).
Visualization for the TOTAL TRIPS taken at Origin and Destination hexagon area
origin_density_map <- left_join(BusStop_hexagon_sf, BusStop_Trips1,
by =c("grid_id" = "ORIGIN_GRID")) %>%
drop_na() %>%
rename(ORIGIN_GRID = grid_id) %>%
group_by(ORIGIN_GRID) %>%
summarize(TOTAL_TRIPS = sum(TRIPS), AVERAGE_DIST = weighted.mean(dist, w = TRIPS))
destin_density_map <- left_join(BusStop_hexagon_sf, BusStop_Trips1,
by =c("grid_id" = "DESTIN_GRID")) %>%
drop_na() %>%
rename(DESTIN_GRID = grid_id) %>%
group_by(DESTIN_GRID) %>%
summarize(TOTAL_TRIPS = sum(TRIPS), AVERAGE_DIST = weighted.mean(dist, w = TRIPS))We will just narrow down to look at number of trips above 100k during this period.
tmap_mode("view")
#tmap_options(check.and.fix = TRUE)
map_honeycomb = tm_shape(origin_density_map %>%
filter(TOTAL_TRIPS>100000)) +
tm_fill(
col = "TOTAL_TRIPS",
palette = "Reds",
style = "cont",
title = "Number of Trips by Origin Bus Stop within the Area",
alpha = 0.4,
popup.vars = c("Number of TRIPS: " = "TOTAL_TRIPS"),
popup.format = list(TOTAL_TRIPS = list(format = "f", digits = 0))) +
tm_borders(col = "grey40", lwd = 0.7)+
tm_scale_bar()
map_honeycomb1 = tm_shape(destin_density_map %>%
filter(TOTAL_TRIPS>100000))+
tm_fill("TOTAL_TRIPS",
style = "cont",
palette = "Reds",
title = "Number of Trips by Destination Bus Stop within the Area",
alpha = 0.4,
popup.vars = c("Number of TRIPS: " = "TOTAL_TRIPS"),
popup.format = list(TOTAL_TRIPS = list(format = "f", digits = 0))) +
tm_borders(col = "grey40", lwd = 0.7)+
tm_scale_bar()
tmap_arrange(map_honeycomb, map_honeycomb1, asp=2, ncol=2)Insights
It is quite surprising to see that there isn’t much alight activity amongst the bus stataion in CBD area, given that this is Weekday morning peak, between 6am to 9am. However, having another thought, this timing is probably more relevant to students going to school (since classes start early from 7 to 8am). Likely the CBD crowd only kicks in from 8-9am, which is significantly lesser in comparison to the whole dataset. We can dive down further (segregate the timing e.g. do one 6-8, one 7-9, to see any difference in results) as the next project.
An interesting sight is that grid_id 5700 has high number of BOTH incoming and outgoing traffic, as reflected by the 301.699 total number of trips as origin area, and 422,497 total number of trips as destination area. This area is likely to be either a bus interchange, or a popular bus stop.
BusStop_Trips1[BusStop_Trips1$ORIGIN_GRID == 5700,]# A tibble: 529 × 8
# Groups: ORIGIN_GRID, ORIGIN_BS [3]
ORIGIN_GRID ORIGIN_BS DESTIN_GRID DESTIN_BS TRIPS FlowNoIntra offset dist
<int> <fct> <int> <chr> <dbl> <dbl> <dbl> <dbl>
1 5700 46009 7833 01019 70 70 1 17120.
2 5700 46009 7930 01059 175 175 1 16759.
3 5700 46009 7881 01119 190 190 1 16937.
4 5700 46009 7880 02049 21 21 1 17444.
5 5700 46009 8024 02089 110 110 1 17910.
6 5700 46009 8120 02099 1 1 1 18057.
7 5700 46009 7786 07539 99 99 1 16289.
8 5700 46009 7882 07589 55 55 1 16434.
9 5700 46009 6675 14009 183 183 1 19164.
10 5700 46009 6627 14139 16 16 1 18858.
# ℹ 519 more rows
We noted the bus stop is 46009
BusStop[BusStop$BUS_STOP_N == 46009,]Simple feature collection with 1 feature and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 22836.28 ymin: 46567.63 xmax: 22836.28 ymax: 46567.63
Projected CRS: SVY21 / Singapore TM
BUS_STOP_N BUS_ROOF_N LOC_DESC geometry
2178 46009 INT WOODLANDS REG INT POINT (22836.28 46567.63)
From the above, we found out that this bus stop is a Interchange station in Woodlands, a densely populated area.
Similarly, we have also noted a few grids where BOTH incoming and outgoing traffic are relatively high. To draw a similar analysis:
list <- intersect(origin_density_map$ORIGIN_GRID[origin_density_map$TOTAL_TRIPS>100000]
, destin_density_map$DESTIN_GRID[destin_density_map$TOTAL_TRIPS>100000])So we know that the grid with >100k BOTH incoming and outgoing traffic are:
list [1] 2993 4250 4908 5700 7648 7703 8467 9532 10286 10772
Next we find out where are these grid:
BusStop_hexagon[BusStop_hexagon$grid_id %in% list, "LOC_DESC"] [1] "TAMPINES INT" "SERANGOON INT"
[3] "TAMPINES INT" "S'GOON STN EXIT A / BLK 413"
[5] "WOODLANDS REG INT" "BOON LAY INT"
[7] "S'GOON STN EXIT C / BLK 201" "BEF PUNGGOL RD"
[9] "CLEMENTI STN" "CHOA CHU KANG INT"
[11] "CLEMENTI STN" "S'GOON STN EXIT E"
[13] "LOT 1 / CHOA CHU KANG STN" "AFT ANG MO KIO STN EXIT A"
[15] "AFT PUNGGOL RD" "TOA PAYOH BUS INT"
[17] "TOA PAYOH BUS INT" "BEDOK BUS INT"
[19] "BEDOK BUS INT" "S'GOON STN EXIT H"
[21] "ANG MO KIO INT" "Aft Tampines Stn Exit E"
[23] "ANG MO KIO STN" "W'LANDS STN EXIT 4"
[25] "W'LANDS STN EXIT 5" "Tampines Stn Exit D"
[27] "OPP CHOA CHU KANG STN" "BEDOK STN EXIT A"
Insights
Noted these are mainly interchange or MRT station so probably make sense as people may change to bus / train at these spot, and there may also be larger number of buses available in these station.
Visualization for the Average distance (weighted) taken at Origin and Destination hexagon area
Now we look at the total average weighted distance traveled.
summary(origin_density_map) ORIGIN_GRID TOTAL_TRIPS AVERAGE_DIST geometry
Min. : 52 Min. : 1 Min. : 325 POLYGON :2451
1st Qu.: 4688 1st Qu.: 775 1st Qu.: 1517 epsg:3414 : 0
Median : 7091 Median : 4603 Median : 2188 +proj=tmer...: 0
Mean : 6762 Mean : 10579 Mean : 2426
3rd Qu.: 8908 3rd Qu.: 13280 3rd Qu.: 3051
Max. :13117 Max. :365868 Max. :14379
summary(destin_density_map) DESTIN_GRID TOTAL_TRIPS AVERAGE_DIST geometry
Min. : 52 Min. : 1 Min. : 325 POLYGON :2456
1st Qu.: 4685 1st Qu.: 1543 1st Qu.: 1418 epsg:3414 : 0
Median : 7089 Median : 4246 Median : 2099 +proj=tmer...: 0
Mean : 6759 Mean : 10557 Mean : 2435
3rd Qu.: 8906 3rd Qu.: 10514 3rd Qu.: 3062
Max. :13117 Max. :422497 Max. :13993
We see the mean is around 2500 average distance, and max of around 14,000 distance. As a gauge we will look at data above 8,000 (8km) average distance:
tmap_mode("view")
#tmap_options(check.and.fix = TRUE)
map_honeycomb2 = tm_shape(origin_density_map %>%
filter(AVERAGE_DIST>8000)) +
tm_fill(
col = "AVERAGE_DIST",
palette = "Reds",
style = "cont",
title = "Average Distance by Origin Bus Stop within the Area",
alpha = 0.4,
popup.vars = c("Average DISTANCE travelled from this origin bus stop: " = "AVERAGE_DIST"),
popup.format = list(AVERAGE_DIST = list(format = "f", digits = 0))) +
tm_borders(col = "grey40", lwd = 0.7)+
tm_scale_bar()
map_honeycomb3 = tm_shape(destin_density_map %>%
filter(AVERAGE_DIST>8000)) +
tm_fill("AVERAGE_DIST",
style = "cont",
palette = "Reds",
title = "Number of Trips by Destination Bus Stop within the Area",
alpha = 0.4,
popup.vars = c("Average DISTANCE travelled to this destination bus stop: " = "AVERAGE_DIST"),
popup.format = list(AVERAGE_DIST = list(format = "f", digits = 0))) +
tm_borders(col = "grey40", lwd = 0.7)+
tm_scale_bar()
tmap_arrange(map_honeycomb2, map_honeycomb3, asp=2, ncol=2)#FOR THOSE >8km AVERAGE TRAVEL DISTANCE WITH THESE ORIGIN BUS STOP
BusStop_hexagon[BusStop_hexagon$grid_id %in% origin_density_map$ORIGIN_GRID[origin_density_map$AVERAGE_DIST>8000], "LOC_DESC"] [1] "CHANGI AIRPORT PTB1" "AFT CHANGI AIRPORT PTB2"
[3] "AFT LIM CHU KANG LANE 8" "CHANGI AIRPORT PTB3"
[5] "CHANGI AIRPORT PTB2" "PROLOGIS"
[7] "DHL" "BEF CP E1"
[9] "YISHUN DAIRY FARM" "BEF CP D5"
[11] "SIGLAP LK" "Aft Mandai Rd"
[13] "POLICE COAST GUARD" "BLK 402"
[15] "AFT CHANGI FERRY RD" "OPP CASABLANCA"
[17] "BEF CHANGI FERRY RD" "EXPEDITORS"
[19] "OPP MANDARIN GDNS" "BAYFRONT STN EXIT B/MBS"
#FOR THOSE >8km AVERAGE TRAVEL DISTANCE WITH THESE DESTINATION BUS STOP
BusStop_hexagon[BusStop_hexagon$grid_id %in% destin_density_map$DESTIN_GRID[destin_density_map$AVERAGE_DIST>8000], "LOC_DESC"] [1] "ST ANNE'S CH" "AIRPORT POLICE STN"
[3] "AFT DAIRY FARM RD" "DHOBY GHAUT STN"
[5] "CHANGI AIRPORT PTB1" "THE SAIL"
[7] "DOWNTOWN STN" "MARINA BAY FINANCIAL CTR"
[9] "AFT LIM CHU KANG LANE 8" "AFT SLE"
[11] "CHANGI AIRPORT PTB3" "CHANGI AIRPORT PTB2"
[13] "PROLOGIS" "DHL"
[15] "TANJONG PAGAR STN EXIT C" "BEF CHANGI AIRPORT PTB3"
[17] "MAPLETREE ANSON" "POLICE COAST GUARD"
[19] "CHANGI NAVAL BASE" "OPP CHANGI NAVAL BASE"
[21] "BEF STEVENS STN EXIT 4" "AFT STEVENS STN EXIT 5"
[23] "AFT CHANGI FERRY RD" "CASABLANCA"
[25] "LP 94" "ORCHARD STN / TANGS PLAZA"
[27] "Marina Bay Stn" "BEF YISHUN AVE 1"
[29] "BEF CHANGI FERRY RD" "OPP DB SCHENKER"
[31] "OPP ORCHARD STN / ION" "YMCA"
[33] "AFT DHOBY GHAUT STN" "MENLO WORLDWIDE"
[35] "EXPEDITORS" "DHOBY GHAUT STN EXIT B"
[37] "NEAR SATS FLIGHT KITCHEN"
Insights
Another interesting observation. For this we will just omit bus stop to and from changi airport, as it can be understood that bus to and from airport are typically of longer distance. The same applies to Changi Naval base/ Opp Changi Naval base as the bus stop is probably serviced by a long distance bus.
The origin bus stops that meet this criteria of >8km average distance are mainly from woodlands/ yishun / causeway. There are also people taking long distance bus and alight at central/ MBFC /Mapletree / CBD /Seletar / Town area, mainly for work.
Visualization of OD FLOW
Creating desire lines
Read the documentation on od2line here
simpleBS_hexagon <- left_join(BusStop_Trips, BusStop_hexagon_sf,
by = c("ORIGIN_GRID" = "grid_id")) %>%
drop_na() %>%
group_by(ORIGIN_GRID) %>%
select(8,1)
simpleBS_hexagon <- unique(simpleBS_hexagon)
simpleBS_hexagon <- st_sf(geometry = simpleBS_hexagon)BS_flowLine <- BusStop_Trips1 %>%
drop_na() %>%
group_by(ORIGIN_GRID, DESTIN_GRID) %>%
summarize(TOTAL_TRIPS = sum(TRIPS)) %>%
filter(TOTAL_TRIPS>1000)
BS_flowLine <- unique(BS_flowLine)flowLine <- od2line(flow = BS_flowLine,
origin_code = "ORIGIN_GRID",
dest_code = "DESTIN_GRID",
zones = BusStop_hexagon_sf,
zone_code = "grid_id") Visualising the desire lines
To dive down, we will look at OD visualizatiion for total trips >1,000 during the timeframe.
tmap_options(check.and.fix = TRUE)
tm_shape(MPSZ) +
tm_polygons() + tm_fill(alpha = 0.1) +
tm_shape(simpleBS_hexagon) +
tm_polygons() +
flowLine %>%
tm_shape() +
tm_lines(lwd = "TOTAL_TRIPS",
style = "quantile",
scale = c(0.1, 1, 3, 5, 7, 10),
n = 6,
alpha = 0.3)It appears that the highest number of long OD flow is between north and east, ALso noted that the dense activity areas, per the above analysis where we noted the high total trips per origin/destination, are also closley linked as one of the OD flow. These areas are typically the interchange or major stops.
Next we will look at the total trips >10,000 during the specific period.
tm_shape(MPSZ) +
tm_polygons() +
tm_shape(simpleBS_hexagon) +
tm_polygons() +
flowLine %>%
filter(TOTAL_TRIPS>10000)%>%
tm_shape() +
tm_lines(lwd = "TOTAL_TRIPS",
style = "quantile",
scale = c(0.1, 1, 3, 5, 7, 10),
n = 6,
alpha = 0.8)This is quite interesting, as we see the most obvious OD flow is between the woodland checkpoint to woodland interchange / mrt station area. It is also interesting to see there is long distance OD from woodlands to the major Punggol/seletar area.
tmap_mode("plot")Assembling VARIABLES
Next moving on to the assemblling at least three propulsive and three attractiveness variables by using aspatial and geospatial from publicly available sources.
We will look at the following:
Population of Age 7 to 12
School (Primary School) Types
School distinctive programme
HDB Information - Occupancy rate
MPSZ
pop <- read_csv("data/aspatial/pop.csv")We will focus on age 7 to 12 therefore will remove the other data set, and remove those zero values.
pop <- pop %>%
select(1:3) %>%
filter(AGE7_12>0)This will gives us the PA, SZ, and Total number of Age7_12.
Next we look into the primary school
school <- read_csv("data/aspatial/Generalinformationofschools.csv")Note we dont need so many different columns, and we only want to look at primary school:
school <- school %>%
filter(mainlevel_code == "PRIMARY") %>%
select(1,3:4,10:11,19:31)write_rds(school, "data/rds/PrimarySchool.csv") next will look into the vairous school distinctive programme
distinctive <- read_csv("data/aspatial/SchoolDistinctiveProgrammes.csv")we will just look into the alp_domain for all primary school
table(distinctive$alp_domain)
Aesthetics Business & Entrepreneurship
7 5
Coding Humanities
11 3
ICT & Media Innovation & Enterprise
8 6
Interdisciplinary Languages
40 34
Languages & Humanities Mathematics
19 2
NULL Science
55 27
STEM
71
Combining this to our primary school data
primary_school <- left_join(school,distinctive) There are three school without the corresponding distinctive, this is likely due to the primary school is new, we will confirm on this later. The school database is updated Nov 2023, while the data in distinctive was updated in 2021.
The few distinctive programme will serve as key variables in this dataset.
Next we will go back to pop data set:
pop <- left_join(pop, MPSZ,
by =c("SZ" = "SUBZONE_N"))We do the same to the primary school
primary_school_MSPZ <- left_join(primary_school, MPSZ,
by =c("dgp_code" = "PLN_AREA_N"))Noted that the data got NA for those dgp_code = SENG KANG, due to discrepencies in format. Therefore we will amend this and rerun
primary_school$dgp_code[primary_school$dgp_code == "SENG KANG"] = "SENGKANG"primary_school_MSPZ <- left_join(primary_school, MPSZ,
by =c("dgp_code" = "PLN_AREA_N"))%>%
select(1,4:5,26:27,8,19:23,30)Now we have a primary_school_MSPZ data set that show each primary school, their distinctive programme, and the associated SZ location,and the pop data set that has population aged 7 to 12 by SZ location.
write_rds(primary_school_MSPZ, "data/rds/primary_school_MSPZ.csv") We will look into this later.
Preparing Origin and Destination Attributes
Preparing origin attribute
pop <- read_csv("data/aspatial/pop.csv")pop <- pop %>%
left_join(MPSZ,
by = c("PA" = "PLN_AREA_N",
"SZ" = "SUBZONE_N")) %>%
select(1:6) %>%
rename(SZ_NAME = SZ,
SZ = SUBZONE_C)mpsz_hexagon <- st_intersection(BusStop_hexagon_sf, MPSZ) %>%
drop_na()flow_data <- left_join(BusStop_Trips1,mpsz_hexagon,
by =c("ORIGIN_GRID" = "grid_id")) %>%
select(1:10)%>%
rename(ORIGIN_SZ = SUBZONE_C,
ORIGIN_SZ_NAME = SUBZONE_N)
flow_data <- unique(flow_data)flow_data <- left_join(flow_data,mpsz_hexagon,
by =c("DESTIN_GRID" = "grid_id")) %>%
select(1:12)%>%
rename(DESTIN_SZ = SUBZONE_C,
DESTIN_SZ_NAME = SUBZONE_N)
flow_data <- unique(flow_data)flow_data1 <- flow_data %>%
left_join(pop,
by = c(ORIGIN_SZ = "SZ")) %>%
rename(ORIGIN_AGE7_12 = AGE7_12,
ORIGIN_AGE13_24 = AGE13_24,
ORIGIN_AGE25_64 = AGE25_64) %>%
select(-c(PA, SZ_NAME))flow_data1 <- flow_data1 %>%
left_join(pop,
by = c(DESTIN_SZ = "SZ")) %>%
rename(DESTIN_AGE7_12 = AGE7_12,
DESTIN_AGE13_24 = AGE13_24,
DESTIN_AGE25_64 = AGE25_64) %>%
select(-c(PA, SZ_NAME))flow_data1 <- unique(flow_data1)write_rds(flow_data1, "data/rds/SIM_data")Calibrating Spatial Interaction Models
Visualising the dependent variable (TRIPS)
ggplot(data = flow_data1,
aes(x = TRIPS)) +
geom_histogram()
Notice that the distribution is highly skewed and not resemble bell shape or also known as normal distribution.
Next, let us visualise the relation between the dependent variable and one of the key independent variable in Spatial Interaction Model, namely distance.
ggplot(data = flow_data1,
aes(x = dist,
y = TRIPS)) +
geom_point() +
geom_smooth(method = lm)
Notice that their relationship hardly resemble linear relationship.
On the other hand, if we plot the scatter plot by using the log transformed version of both variables, we can see that their relationship is more resemble linear relationship.
ggplot(data = flow_data1,
aes(x = log(dist),
y = log(TRIPS))) +
geom_point() +
geom_smooth(method = lm)
Checking for variables with zero values
Since Poisson Regression is based of log and log 0 is undefined, it is important for us to ensure that no 0 values in the explanatory variables.
In the code chunk below, summary() of Base R is used to compute the summary statistics of all variables:
summary(flow_data1) ORIGIN_GRID ORIGIN_BS DESTIN_GRID DESTIN_BS
Min. : 52 54009 : 3448 Min. : 52 Length:985230
1st Qu.: 5916 84009 : 3132 1st Qu.: 6056 Class :character
Median : 7473 46009 : 2565 Median : 7494 Mode :character
Mean : 7293 17009 : 2349 Mean : 7287
3rd Qu.: 8651 60179 : 2345 3rd Qu.: 8517
Max. :13117 22009 : 2242 Max. :13117
(Other):969149
TRIPS FlowNoIntra offset dist
Min. : 1.00 Min. : 0.00 Min. :0.000001 Min. : 0
1st Qu.: 3.00 1st Qu.: 3.00 1st Qu.:1.000000 1st Qu.: 2030
Median : 14.00 Median : 14.00 Median :1.000000 Median : 3994
Mean : 97.22 Mean : 97.05 Mean :0.995711 Mean : 5027
3rd Qu.: 53.00 3rd Qu.: 53.00 3rd Qu.:1.000000 3rd Qu.: 7083
Max. :96630.00 Max. :96630.00 Max. :1.000000 Max. :24758
ORIGIN_SZ_NAME ORIGIN_SZ DESTIN_SZ_NAME DESTIN_SZ
Length:985230 Length:985230 Length:985230 Length:985230
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
ORIGIN_AGE7_12 ORIGIN_AGE13_24 ORIGIN_AGE25_64 DESTIN_AGE7_12
Min. : 0 Min. : 0 Min. : 0 Min. : 0.0
1st Qu.: 180 1st Qu.: 410 1st Qu.: 1860 1st Qu.: 90.0
Median : 730 Median : 1580 Median : 7370 Median : 620.0
Mean :1088 Mean : 2434 Mean :11264 Mean : 986.4
3rd Qu.:1610 3rd Qu.: 3470 3rd Qu.:16780 3rd Qu.:1480.0
Max. :6340 Max. :16380 Max. :74610 Max. :6340.0
NA's :27 NA's :27 NA's :27 NA's :14
DESTIN_AGE13_24 DESTIN_AGE25_64
Min. : 0 Min. : 0
1st Qu.: 170 1st Qu.: 900
Median : 1300 Median : 6190
Mean : 2200 Mean :10207
3rd Qu.: 3290 3rd Qu.:15830
Max. :16380 Max. :74610
NA's :14 NA's :14
The print report above reveals that variables ORIGIN_AGE7_12, ORIGIN_AGE13_24, ORIGIN_AGE25_64,DESTIN_AGE7_12, DESTIN_AGE13_24, DESTIN_AGE25_64 consist of 0 values.
flow_data1$DESTIN_AGE7_12 <- ifelse(
flow_data1$DESTIN_AGE7_12 == 0,
0.99, flow_data1$DESTIN_AGE7_12)
flow_data1$DESTIN_AGE13_24 <- ifelse(
flow_data1$DESTIN_AGE13_24 == 0,
0.99, flow_data1$DESTIN_AGE13_24)
flow_data1$DESTIN_AGE25_64 <- ifelse(
flow_data1$DESTIN_AGE25_64 == 0,
0.99, flow_data1$DESTIN_AGE25_64)
flow_data1$ORIGIN_AGE7_12 <- ifelse(
flow_data1$ORIGIN_AGE7_12 == 0,
0.99, flow_data1$ORIGIN_AGE7_12)
flow_data1$ORIGIN_AGE13_24 <- ifelse(
flow_data1$ORIGIN_AGE13_24 == 0,
0.99, flow_data1$ORIGIN_AGE13_24)
flow_data1$ORIGIN_AGE25_64 <- ifelse(
flow_data1$ORIGIN_AGE25_64 == 0,
0.99, flow_data1$ORIGIN_AGE25_64)summary(flow_data1) ORIGIN_GRID ORIGIN_BS DESTIN_GRID DESTIN_BS
Min. : 52 54009 : 3448 Min. : 52 Length:985230
1st Qu.: 5916 84009 : 3132 1st Qu.: 6056 Class :character
Median : 7473 46009 : 2565 Median : 7494 Mode :character
Mean : 7293 17009 : 2349 Mean : 7287
3rd Qu.: 8651 60179 : 2345 3rd Qu.: 8517
Max. :13117 22009 : 2242 Max. :13117
(Other):969149
TRIPS FlowNoIntra offset dist
Min. : 1.00 Min. : 0.00 Min. :0.000001 Min. : 0
1st Qu.: 3.00 1st Qu.: 3.00 1st Qu.:1.000000 1st Qu.: 2030
Median : 14.00 Median : 14.00 Median :1.000000 Median : 3994
Mean : 97.22 Mean : 97.05 Mean :0.995711 Mean : 5027
3rd Qu.: 53.00 3rd Qu.: 53.00 3rd Qu.:1.000000 3rd Qu.: 7083
Max. :96630.00 Max. :96630.00 Max. :1.000000 Max. :24758
ORIGIN_SZ_NAME ORIGIN_SZ DESTIN_SZ_NAME DESTIN_SZ
Length:985230 Length:985230 Length:985230 Length:985230
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
ORIGIN_AGE7_12 ORIGIN_AGE13_24 ORIGIN_AGE25_64 DESTIN_AGE7_12
Min. : 0.99 Min. : 0.99 Min. : 0.99 Min. : 0.99
1st Qu.: 180.00 1st Qu.: 410.00 1st Qu.: 1860.00 1st Qu.: 90.00
Median : 730.00 Median : 1580.00 Median : 7370.00 Median : 620.00
Mean :1088.08 Mean : 2434.32 Mean :11264.18 Mean : 986.60
3rd Qu.:1610.00 3rd Qu.: 3470.00 3rd Qu.:16780.00 3rd Qu.:1480.00
Max. :6340.00 Max. :16380.00 Max. :74610.00 Max. :6340.00
NA's :27 NA's :27 NA's :27 NA's :14
DESTIN_AGE13_24 DESTIN_AGE25_64
Min. : 0.99 Min. : 0.99
1st Qu.: 170.00 1st Qu.: 900.00
Median : 1300.00 Median : 6190.00
Mean : 2200.29 Mean :10207.40
3rd Qu.: 3290.00 3rd Qu.:15830.00
Max. :16380.00 Max. :74610.00
NA's :14 NA's :14
Origin (Production) constrained SIM
from the summary table above we noted tehre are NA’s.
flow_data1[is.na(flow_data1$ORIGIN_AGE7_12),]# A tibble: 27 × 18
# Groups: ORIGIN_GRID, ORIGIN_BS [3]
ORIGIN_GRID ORIGIN_BS DESTIN_GRID DESTIN_BS TRIPS FlowNoIntra offset dist
<int> <fct> <int> <chr> <dbl> <dbl> <dbl> <dbl>
1 4224 46239 5078 46069 1 1 1 6344.
2 4224 46239 5125 46088 2 2 1 6668.
3 4224 46239 5125 46088 2 2 1 6668.
4 4224 46239 5126 46109 9 9 1 6175
5 4224 46239 5033 46219 3381 3381 1 4585.
6 5033 46211 5126 46109 14 14 1 1720.
7 5033 46211 5033 46219 2 0 0.000001 0
8 5033 46211 4224 46239 5676 5676 1 4585.
9 5033 46219 4825 44049 3 3 1 9030.
10 5033 46219 4825 44049 3 3 1 9030.
# ℹ 17 more rows
# ℹ 10 more variables: ORIGIN_SZ_NAME <chr>, ORIGIN_SZ <chr>,
# DESTIN_SZ_NAME <chr>, DESTIN_SZ <chr>, ORIGIN_AGE7_12 <dbl>,
# ORIGIN_AGE13_24 <dbl>, ORIGIN_AGE25_64 <dbl>, DESTIN_AGE7_12 <dbl>,
# DESTIN_AGE13_24 <dbl>, DESTIN_AGE25_64 <dbl>
For the purpose of this exercise where we just want data within the specific hexagon areas, we will drop the NA data, and remember to remove the those whiten the same hexagon (dist = 0)
flow_data1 <- flow_data1 %>%
drop_na() %>%
filter(dist >0)summary(flow_data1) ORIGIN_GRID ORIGIN_BS DESTIN_GRID DESTIN_BS
Min. : 52 54009 : 3416 Min. : 52 Length:980967
1st Qu.: 5920 84009 : 3123 1st Qu.: 6056 Class :character
Median : 7492 46009 : 2556 Median : 7494 Mode :character
Mean : 7295 60179 : 2345 Mean : 7289
3rd Qu.: 8651 17009 : 2331 3rd Qu.: 8517
Max. :13117 22009 : 2242 Max. :13117
(Other):964954
TRIPS FlowNoIntra offset dist
Min. : 1.00 Min. : 1.00 Min. :1 Min. : 325
1st Qu.: 3.00 1st Qu.: 3.00 1st Qu.:1 1st Qu.: 2030
Median : 14.00 Median : 14.00 Median :1 Median : 3994
Mean : 97.29 Mean : 97.29 Mean :1 Mean : 5049
3rd Qu.: 53.00 3rd Qu.: 53.00 3rd Qu.:1 3rd Qu.: 7128
Max. :75452.00 Max. :75452.00 Max. :1 Max. :24758
ORIGIN_SZ_NAME ORIGIN_SZ DESTIN_SZ_NAME DESTIN_SZ
Length:980967 Length:980967 Length:980967 Length:980967
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
ORIGIN_AGE7_12 ORIGIN_AGE13_24 ORIGIN_AGE25_64 DESTIN_AGE7_12
Min. : 0.99 Min. : 0.99 Min. : 0.99 Min. : 0.99
1st Qu.: 180.00 1st Qu.: 410.00 1st Qu.: 1860.00 1st Qu.: 90.00
Median : 730.00 Median : 1580.00 Median : 7370.00 Median : 620.00
Mean :1086.96 Mean : 2431.61 Mean :11253.33 Mean : 985.03
3rd Qu.:1610.00 3rd Qu.: 3470.00 3rd Qu.:16780.00 3rd Qu.:1480.00
Max. :6340.00 Max. :16380.00 Max. :74610.00 Max. :6340.00
DESTIN_AGE13_24 DESTIN_AGE25_64
Min. : 0.99 Min. : 0.99
1st Qu.: 170.00 1st Qu.: 900.00
Median : 1300.00 Median : 6120.00
Mean : 2196.56 Mean :10191.93
3rd Qu.: 3290.00 3rd Qu.:15830.00
Max. :16380.00 Max. :74610.00
Unconstrained Spatial Interaction Model
uncSIM <- glm(formula = TRIPS ~
log(ORIGIN_AGE7_12) +
log(DESTIN_AGE7_12) +
log(dist),
family = poisson(link = "log"),
data = flow_data1,
na.action = na.exclude)
uncSIM
Call: glm(formula = TRIPS ~ log(ORIGIN_AGE7_12) + log(DESTIN_AGE7_12) +
log(dist), family = poisson(link = "log"), data = flow_data1,
na.action = na.exclude)
Coefficients:
(Intercept) log(ORIGIN_AGE7_12) log(DESTIN_AGE7_12)
10.25035 0.09730 -0.01915
log(dist)
-0.78412
Degrees of Freedom: 980966 Total (i.e. Null); 980963 Residual
Null Deviance: 370600000
Residual Deviance: 308100000 AIC: 312500000
R-squared function
CalcRSquared <- function(observed,estimated){
r <- cor(observed,estimated)
R2 <- r^2
R2
}CalcRSquared(uncSIM$data$TRIPS, uncSIM$fitted.values)[1] 0.02258983
r2_mcfadden(uncSIM)# R2 for Generalized Linear Regression
R2: 0.167
adj. R2: 0.167
Origin (Production) constrained SIM
orcSIM <- glm(formula = TRIPS ~
ORIGIN_GRID +
log(DESTIN_AGE7_12) +
log(dist),
family = poisson(link = "log"),
data = flow_data1,
na.action = na.exclude)
summary(orcSIM)
Call:
glm(formula = TRIPS ~ ORIGIN_GRID + log(DESTIN_AGE7_12) + log(dist),
family = poisson(link = "log"), data = flow_data1, na.action = na.exclude)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.089e+01 8.407e-04 12959.7 <2e-16 ***
ORIGIN_GRID -4.072e-05 4.995e-08 -815.2 <2e-16 ***
log(DESTIN_AGE7_12) 9.087e-03 3.737e-05 243.2 <2e-16 ***
log(dist) -7.755e-01 1.017e-04 -7622.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 370603551 on 980966 degrees of freedom
Residual deviance: 312431320 on 980963 degrees of freedom
AIC: 316887141
Number of Fisher Scoring iterations: 7
We can examine how the constraints hold for destinations this time
CalcRSquared(orcSIM$data$TRIPS, orcSIM$fitted.values)[1] 0.02050437
Destination constrained
decSIM <- glm(formula = TRIPS ~
DESTIN_GRID +
log(ORIGIN_AGE7_12) +
log(dist),
family = poisson(link = "log"),
data = flow_data1,
na.action = na.exclude)
summary(decSIM)
Call:
glm(formula = TRIPS ~ DESTIN_GRID + log(ORIGIN_AGE7_12) + log(dist),
family = poisson(link = "log"), data = flow_data1, na.action = na.exclude)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.039e+01 8.519e-04 12195.4 <2e-16 ***
DESTIN_GRID -4.864e-05 5.027e-08 -967.6 <2e-16 ***
log(ORIGIN_AGE7_12) 9.860e-02 4.542e-05 2170.7 <2e-16 ***
log(dist) -7.713e-01 1.009e-04 -7640.6 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 370603551 on 980966 degrees of freedom
Residual deviance: 307400985 on 980963 degrees of freedom
AIC: 311856806
Number of Fisher Scoring iterations: 7
We can examine how the constraints hold for destinations this time.
CalcRSquared(decSIM$data$TRIPS, decSIM$fitted.values)[1] 0.02321934
Doubly constrained
dbcSIM <- glm(formula = TRIPS ~
ORIGIN_GRID +
DESTIN_GRID +
log(dist),
family = poisson(link = "log"),
data = flow_data1,
na.action = na.exclude)
summary(dbcSIM)
Call:
glm(formula = TRIPS ~ ORIGIN_GRID + DESTIN_GRID + log(dist),
family = poisson(link = "log"), data = flow_data1, na.action = na.exclude)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.093e+01 8.178e-04 13368.8 <2e-16 ***
ORIGIN_GRID -7.140e-05 1.147e-07 -622.4 <2e-16 ***
DESTIN_GRID 3.658e-05 1.153e-07 317.2 <2e-16 ***
log(dist) -7.797e-01 1.016e-04 -7676.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 370603551 on 980966 degrees of freedom
Residual deviance: 312390431 on 980963 degrees of freedom
AIC: 316846252
Number of Fisher Scoring iterations: 7
Model comparison
Using compare_performance() of performance package
model_list <- list(unconstrained=uncSIM,
originConstrained=orcSIM,
destinationConstrained=decSIM,
doublyConstrained=dbcSIM)compare_performance(model_list,
metrics = "RMSE")# Comparison of Model Performance Indices
Name | Model | RMSE
----------------------------------------
unconstrained | glm | 520.042
originConstrained | glm | 520.589
destinationConstrained | glm | 519.871
doublyConstrained | glm | 520.656
This compute the RMSE of all the models in model_list file.
The print above reveals that all the SIMs are of similar value.
Visualising fitted
We will extract the fitted values from each model, join the values to data_flow1 data frame, for all the models.
df <- as.data.frame(uncSIM$fitted.values) %>%
round(digits = 0)
flow_data1 <- flow_data1 %>%
cbind(df) %>%
rename(uncTRIPS = "uncSIM$fitted.values")df <- as.data.frame(orcSIM$fitted.values) %>%
round(digits = 0)
flow_data1 <- flow_data1 %>%
cbind(df) %>%
rename(orcTRIPS = "orcSIM$fitted.values")df <- as.data.frame(decSIM$fitted.values) %>%
round(digits = 0)
flow_data1 <- flow_data1 %>%
cbind(df) %>%
rename(decTRIPS = "decSIM$fitted.values")df <- as.data.frame(dbcSIM$fitted.values) %>%
round(digits = 0)
flow_data1 <- flow_data1 %>%
cbind(df) %>%
rename(dbcTRIPS = "dbcSIM$fitted.values")Now we plot it:
unc_p <- ggplot(data = flow_data1,
aes(x = uncTRIPS,
y = TRIPS)) +
geom_point() +
geom_smooth(method = lm)
orc_p <- ggplot(data = flow_data1,
aes(x = orcTRIPS,
y = TRIPS)) +
geom_point() +
geom_smooth(method = lm)
dec_p <- ggplot(data = flow_data1,
aes(x = decTRIPS,
y = TRIPS)) +
geom_point() +
geom_smooth(method = lm)
dbc_p <- ggplot(data = flow_data1,
aes(x = dbcTRIPS,
y = TRIPS)) +
geom_point() +
geom_smooth(method = lm)
ggarrange(unc_p, orc_p, dec_p, dbc_p,
ncol = 2,
nrow = 2)
From the above, we can draw the conclusion that there isnt much of a correlation between the
Future Work - Train Station
TrainStation <- st_read(dsn="data/geospatial",
layer="RapidTransitSystemStation")%>%
st_transform(crs = 3414)Reading layer `RapidTransitSystemStation' from data source
`C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 220 features and 4 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21
I see that TrainStation is a Polygon geometry SF with “TYP_CD”, “STN_NAM”, “TYP_CD_DES”, “STN_NAM_DE”, and its geometry.
Note that “TYP_CD” is of all 0 value, “STN_NAM” is of all NA value, while “TYP_CD_DES” value is “MRT” or “LRT”.
Also note that the above shows a warning message:
Warning: GDAL Message 1: Non closed ring detected. To avoid accepting it, set the OGR_GEOMETRY_ACCEPT_UNCLOSED_RING configuration option to NO
invalid_geoms <- TrainStation[!st_is_valid(TrainStation), ]It appears to have 3 invalid geoms, apart form the NA entry, we can see that the both “HARBOURFRONT MRT STATION” and “UPPER THOMSON MRT STATION” are the invalid geometries.
For the purpose of this exercise, I will filter out these entries and retain on the valid geometries in TrainStation.
TrainStation <- TrainStation[st_is_valid(TrainStation), ]%>%
st_transform(crs = 3414)TrainStation_Exit <- st_read(dsn="data/geospatial",
layer="Train_Station_Exit_Layer")%>%
st_transform(crs = 3414)Reading layer `Train_Station_Exit_Layer' from data source
`C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 565 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 6134.086 ymin: 27499.7 xmax: 45356.36 ymax: 47865.92
Projected CRS: SVY21
I see that TrainStation_Exit is a Point geometry SF with “stn_name”, “exit_code”, and its geometry.
I get aspatial Data of PASSENGER VOLUME BY ORIGIN DESTINATION TRAIN STATIONS, downloaded via API (postman GET) from Data Mall LTA. For the purpose of this exercise the Aug 2023 Data will be used.
OD_train <- read_csv("data/aspatial/origin_destination_train_202308.csv") #non-spatial data with no geometry featuresOD_train$ORIGIN_PT_CODE <- as.factor(OD_train$ORIGIN_PT_CODE)
OD_train$DESTINATION_PT_CODE <- as.factor(OD_train$DESTINATION_PT_CODE)Noted that both the aspatial data, for bus and train, are similar where there are total of 7 columns YEAR_MONTH, DAY_TYPE, PT_TYPE, ORIGIN_PT_CODE, DESTINATION_PT_CODE, TIME_PER_HOUR, TOTAL_TRIPS.
We will now first combine the SF data sources relating to Bus Stop, using st_intersection():
TrainStation_combined <- st_intersection(TrainStation, MPSZ)Before proceeding it will be good practice to check the crs of each using st_crs.
We will now first combine the SF data sources relating to Train Station, using st_intersection():
TrainStation_combined <- st_intersection(TrainStation_combined, TrainStation_Exit) %>%
mutate(STN_NAM_DE = str_replace(STN_NAM_DE, "MRT STATION", "")) %>%
mutate(STN_NAM_DE = str_replace(STN_NAM_DE, "LRT STATION", ""))%>%
select(3:13) %>%
mutate(STN_NAM_DE = str_trim(STN_NAM_DE))Note this data frame does not include the two MRT stations we filtered out earlier.
Note that we have removed the first two column (NA and 0 values), in order to associate the train station code, we will import the following data set and join them:
TrainStation_Code <- readxl::read_excel("data/aspatial/Train Station Codes and Chinese Names.xls")
TrainStation_Code$mrt_station_english <- toupper(TrainStation_Code$mrt_station_english)TrainStation_combined <- left_join(TrainStation_combined,
TrainStation_Code,
by = c("STN_NAM_DE" = "mrt_station_english"))For reproducibility, will save into a rds file:
write_rds(TrainStation_combined, "data/rds/TrainStation_combined.csv") Next, we are going to append the planning subzone code from TrainStation_combined data frame onto OD_train data frame.
Origin_TrainStation <- left_join(OD_train , TrainStation_combined,
by = c("ORIGIN_PT_CODE" = "stn_code")) %>%
select(1:11,16) Looking at the results, noted there are a number of mismatches due to the origin PT code having multiple values. We will need to look into this and clean up the dataset for better visualiztion.
OD_train_filter <- OD_train %>%
mutate(ORIGIN_PT_CODE = str_replace(ORIGIN_PT_CODE, "/.*", "")) %>%
mutate(DESTINATION_PT_CODE = str_replace(DESTINATION_PT_CODE, "/.*", ""))This removes the additional train codes with multiple lines (e.g. Botanic Garden is both Circle line and Downtown line), and only retain one, so to avoid double counting. Then we run the left_join using this dataframe:
Origin_TrainStation_filter <- left_join(OD_train_filter , TrainStation_combined,
by = c("ORIGIN_PT_CODE" = "stn_code")) %>%
select(1:11,16) Own Notes
Constructing an O/D Matrix
In O/D matrix, the Ti sum of row representing total output of origin location, while sum of column represents input of destination.
Can have sub-matrix
Additional new activity may change this to new structure
O/D Matrix can be costly, 322 x 322 = 110k O/D pairs, which each info need to be carefully provided (but may also change).
GPS/ Smart card can collect personal info to represent flows between locations
There are 3 Spatial Interaction Models (First 2 steps)
Basic assumption is that function of attributes of location of origin and destination and frictions
Potential Model is usually for measuring accessibility
Retail model usually used for franchise to choose service area of store of delivery segment
This course we will focus on Gravity Model which is the most common model. It uses Newton first law of gravity. Estimated Tij is transition/trip or flow between origin i (row) and destination j(columns). Parameters V, W, d, k lamda, alpha and beta. Beta is always assumed to be negative as increase in cost.distance will likely decrease the interaction.
A family of gravity Models - Unconstrained (totally constrained), origin constrained, destination constrained, doubly constrained.
We can see that sp is in list, no geometric column, they are segregated as different table inside object, In tidyverse it is in a whole table. But to call a field in sp, we will need to write something like below
select(mpsz@data$SUBZONE)